Residuals Gaussian Distribution is more a destination arrival than a departure point
Whenever we have a Quantitative problem, there is one method we can not avoid: Linear Regression. It is the foundation for many other tools, but it’s importance goes far beyond that.
In practice it has several properties and predictive power that suggests that running a regression may be both a good Jumping-off point and deliverable product. On one hand, as it is the foundation stone, we can approach our problem and begin questioning data and knowing it with “boring” classical statistical methods and it will yield two important things:
- Insights about the data, which is useful for model selection.
- Roadmap to build an appropriate linear regression model.
In other words, by following the “Good Old Linear Regression Recipe” we will be able to either have our model ready, or collected enough information about the data to chose the correct model. Not that bad for the Oldie.
If you torture your data enough, it will confess what you want to hear
In most real-life cases, data will not speak at loud. But just asking questions is not enough, asking the right ones is far more useful than asking a plethora. A flexible routine may be:
The formula:
\[ Y = \beta_0 + \beta_1 X \]
Where the coefficients or parameters represent:
- \(\beta_0\) is the intercept
- \(\beta_1\) is the slope
-> In reality are unknown, so we denote the Estimated parameters with \(^\).
A very important component are Residuals, denoted as the difference between the **observed value and the estimated value:
\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 X \\ e_i = y_i - \hat{y}_i \]
For demostration we load the Advertising dataset:
Advertising <- fread(file = file.path('data', 'Advertising.csv'))
head(Advertising)
## V1 TV radio newspaper sales
## 1: 1 230.1 37.8 69.2 22.1
## 2: 2 44.5 39.3 45.1 10.4
## 3: 3 17.2 45.9 69.3 9.3
## 4: 4 151.5 41.3 58.5 18.5
## 5: 5 180.8 10.8 58.4 12.9
## 6: 6 8.7 48.9 75.0 7.2
str(Advertising)
## Classes 'data.table' and 'data.frame': 200 obs. of 5 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
## $ newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
## $ sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(Advertising)
## V1 TV radio newspaper
## Min. : 1.00 Min. : 0.70 Min. : 0.000 Min. : 0.30
## 1st Qu.: 50.75 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75
## Median :100.50 Median :149.75 Median :22.900 Median : 25.75
## Mean :100.50 Mean :147.04 Mean :23.264 Mean : 30.55
## 3rd Qu.:150.25 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10
## Max. :200.00 Max. :296.40 Max. :49.600 Max. :114.00
## sales
## Min. : 1.60
## 1st Qu.:10.38
## Median :12.90
## Mean :14.02
## 3rd Qu.:17.40
## Max. :27.00
Then we fit a simple model regressing sales on TV and calculate the residuals:
model_tv <- lm(sales ~ TV, data = Advertising)
df_model_tv = Advertising
df_model_tv[, predicted_sales := predict(model_tv)]
df_model_tv[, residuals_hand := sales - predicted_sales]
df_model_tv[, residuals_model := model_tv$residuals]
Now that we have our model and its residuals we can plot them:
ggplot(df_model_tv,
aes(x = TV,
y = sales)) +
geom_point(col = "#7D6B7D", alpha = 0.5, size = 3) +
geom_segment(aes(xend = TV, yend = predicted_sales), alpha = 0.1, size = 1) +
stat_smooth(method = "lm", col = "#FF665A")
## `geom_smooth()` using formula 'y ~ x'
From the graph we can observe geometrically that:
- Observed value = \(y_i\) are the circular points - Predicted (fitted, estimated) = \(\hat{y}_i\) are the points lying on the red line - Residuals = \(e_i\) are the segments magnitude, calculated from the red line to the circular point
The Residual Sum of Squares is a measurement of model (in)accuracy, written as:
\[
RSS = (e_{1}^2 + e_{2}^2 + e_{3}^2 + ... + e_{n}^2)
\] For instance, the ordinary least squares method minimizes the \(RSS\) to find the best model, which can be proved that it is unbiased (does not systematically under/over estimate values).
The moment’s in a nutshell are:
- \(\hat{\mu}\) is an accurate estimate of the population mean but, - The Standard Error (SE) tell us that \(Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n})\) , which means how far our mean is from the population mean. And… - We see that \(\sigma^2\). is the deviation or variance of all realizations \(y_i\) of \(Y\).
To assess the model accuracy we look at two related quantitaties: #### RSE: Residual Standard Error - RSE roughly speaking is the average amount that the response will deviate from the true regression line. It is an estimate of the standard deviation of \(\epsilon\) - \(RSE = \sqrt{\frac{1}{n-2}RSS}\)
- It provides an absolute measure of lack of fit of the model in the same unit of measure of \(Y\)
When running a regression with multiple variables there are other questions to ask:
1. Is at least one of the predictors in \(X\) useful in predicting \(Y\)?
- We use the F-statistic = . If F is close to 1 there is no relationship 2. Do all predictors (or a subset of them) help to explain \(Y\)?
- There are several ways to address this question, depending on the data and the model usage.
A slight variation for the model fit assessment is that for the RSE:
- \(RSE = \sqrt{\frac{1}{n-p-1}RSS}\)
plotly::plot_ly(data = df_model_tv,
x = ~TV,
y = ~radio,
z = ~sales,
type = "scatter3d",
mode = "markers",
color = '#7D6B7D')
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
There are two assumptions that are strict but can be relaxed:
If there is not even a pseudo-linear relationship among a predictor/s and the response variable, then the world famous “Residual plots” will tell us.
For this case we will show the Auto dataset:
auto <- fread(file = file.path('data', 'Auto.csv'))
head(auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1: 18 8 307 130 3504 12.0 70 1
## 2: 15 8 350 165 3693 11.5 70 1
## 3: 18 8 318 150 3436 11.0 70 1
## 4: 16 8 304 150 3433 12.0 70 1
## 5: 17 8 302 140 3449 10.5 70 1
## 6: 15 8 429 198 4341 10.0 70 1
## name
## 1: chevrolet chevelle malibu
## 2: buick skylark 320
## 3: plymouth satellite
## 4: amc rebel sst
## 5: ford torino
## 6: ford galaxie 500
str(auto)
## Classes 'data.table' and 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130" "165" "150" "150" ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, ".internal.selfref")=<externalptr>
## horsepower is not numeric
auto[, horsepower := as.numeric(horsepower)]
## Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
auto = auto[!is.na(horsepower)]
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
model_auto_linear <- lm(mpg ~ horsepower, data = auto)
model_auto_quad <- lm(mpg ~ horsepower + poly(horsepower, 2), data = auto)
model_auto_vr <- lm(log(mpg) ~ horsepower + poly(horsepower, 2), data = auto)
df_model_auto = auto
df_model_auto[, predicted_linear := predict(model_auto_linear)]
df_model_auto[, predicted_quad := predict(model_auto_quad)]
df_model_auto[, predicted_vr := predict(model_auto_vr)]
df_model_auto[, residuals_linear := model_auto_linear$residuals]
df_model_auto[, residuals_quad := model_auto_quad$residuals]
df_model_auto[, residuals_vr:= model_auto_vr$residuals]
model_lin <-
ggplot(df_model_auto,
aes(x = horsepower,
y = mpg)) +
geom_point(col = "#7D6B7D", alpha = 0.5, size = 3) +
geom_line(aes(y = predicted_linear), colour = '#DEA800', size = 2) +
labs(title = latex2exp::TeX(r"($mpg ~ \beta_0 horsepower$)"))
resid_lin <-
ggplot(df_model_auto,
aes(x = predicted_linear,
y = residuals_linear)) +
geom_point(col = "#A8C0CE", alpha = 0.5, size = 3) +
labs(title = 'Linear Horseporwer Residuals')
grid.arrange(model_lin, resid_lin, ncol = 2)
model_quad <-
ggplot(df_model_auto,
aes(x = horsepower,
y = mpg)) +
geom_point(col = "#7D6B7D", alpha = 0.5, size = 3) +
geom_line(aes(y = predicted_quad), colour = '#DEA800', size = 2) +
labs(title = latex2exp::TeX(r"($mpg ~ \beta_0 horsepower + \beta_1 horsepower^2$)"))
resid_quad <-
ggplot(df_model_auto,
aes(x = predicted_quad,
y = residuals_quad)) +
geom_point(col = "#A8C0CE", alpha = 0.5, size = 3) +
labs(title = 'Linear + Quadratic Horseporwer Residuals')
grid.arrange(model_quad, resid_quad, ncol = 2)
As we can observe from the plots, when there was not a second order variable the model was not that bad, but with the residuals plot we identified clearly that a quadratic term was not captured. In the second model instead it is hard to identify any visible pattern.
If we recall that in our model we specify an error term \(\epsilon_i\) that follows a Gaussian distribution, if their Variance is not constant then a very important assumption is violated and coefficient estimation will be misguided.
The measures to be taken to attend this issue depend on how Variance is not constant. The most common ones are transforming the response variable \(Y\), for instance \(ln(Y), \sqrt{Y}\).
Recalling the model_auto_quad residuals plot, if we fit a new model with a \(log\) transformation we will see how the increscendo pattern dissapears.
model_vr <-
ggplot(df_model_auto,
aes(x = horsepower,
y = log(mpg))) +
geom_point(col = "#7D6B7D", alpha = 0.5, size = 3) +
geom_line(aes(y = predicted_vr), colour = '#DEA800', size = 2) +
labs(title = latex2exp::TeX(r"($log(mpg) ~ \beta_0 horsepower$)"))
resid_vr <-
ggplot(df_model_auto,
aes(x = predicted_vr,
y = residuals_vr)) +
geom_point(col = "#A8C0CE", alpha = 0.5, size = 3) +
labs(title = 'Log transformed + QUadratic term Horseporwer Residuals')
grid.arrange(model_vr, resid_vr, ncol = 2)
Another technique that is useful in some datasets is fitting the model through weighted least squares, specially on small to medium size datasets.
While outliers per se may contain useful information, such as extreme events, special cases to consider, data anomalies in collection, among others, and hence treating them carefully; high leverage values are not taken that serious taking into account how strong they can affect our model even if they are legitimate values.
Both can affect the model importantly as many algorithms use the mean and variance for estimation, and we all know that this stats are sensitive to extreme values.
As always, plots may be useful to identify these values, but as we are talking of single values and not patterns it is harder to know at plain sight the difference between acceptable rare or extreme. Numeric approaches instead are:
- For outliers: \(studentized residuals = \frac{\epsilon_i}{SE(\epsilon_i)}\) - for values greater than 3 we can consider them as outliers. - we remove extreme \(Y\) values.
- For high leverage values: \(leveragestatistic = h_{i} =\frac{1}{n}+\frac{\left(x_{i}-\bar{x}\right)^{2}}{\sum_{i^{\prime}=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}\) - The leverage statistic h i is always between \(1/n\) and 1, and the average leverage for all the observations is always equal to \((p + 1)/n\). So if a given observation has a leverage statistic that greatly exceeds \((p+1)/n\), then we may suspect that the corresponding point has high leverage. - we remove extreme \(X\) values.
df_model_auto[, stud_res := MASS::studres(model_auto_vr)]
df_model_auto[, stud_res_pos := fifelse(stud_res >= 3 | stud_res <= -3, 'outlier', 'ok')]
studres_vr <-
ggplot(df_model_auto,
aes(x = horsepower,
y = stud_res,
color = stud_res_pos)) +
geom_point(alpha = 0.5, size = 3) +
geom_line(aes(y = 0, alpha = 1, size = 2), colour = '#FF665A') +
geom_line(aes(y = 3, alpha = 0.75, size = 4), colour = '#FF665A') +
geom_line(aes(y = -3, alpha = 0.75, size = 4), colour = '#FF665A') +
labs(title = latex2exp::TeX(r"($log(mpg) ~ \beta_0 horsepower- Studentized Residuals$)")) +
xlab('horsepower') + ylab('Studentized Residuals')
We will remove the values identified as outliers and high leverage and compare the \(R^2\) for all models:
### No Outliers
df_model_auto_out = df_model_auto[stud_res_pos == 'ok']
model_auto_vr_out <- lm(log(mpg) ~ horsepower + poly(horsepower, 2), data = df_model_auto_out)
sm_lin = summary(model_auto_linear)$adj.r.squared
sm_quad = summary(model_auto_quad)$adj.r.squared
sm_vr = summary(model_auto_vr)$adj.r.squared
sm_vr_out = summary(model_auto_vr_out)$adj.r.squared
data.table(linear = sm_lin,
quadratic = sm_quad,
var_constant = sm_vr,
no_outliers = sm_vr_out)
## linear quadratic var_constant no_outliers
## 1: 0.6049379 0.6859527 0.7309888 0.758253
### Artificially add a HLV
hlv = df_model_auto[1]
hlv$horsepower <- 350
df_model_auto_hlv = rbind(df_model_auto, hlv)
model_auto_vr_out_hlv <- lm(log(mpg) ~ horsepower + poly(horsepower, 2), data = df_model_auto_hlv)
hats <- data.frame(hlv = hatvalues(model_auto_vr_out_hlv),
x = 1:nrow(df_model_auto_hlv))
ggplot(hats,
aes(x = x,
y = hlv)) +
geom_col(fill = '#6593A6', color = '#6593A6', alpha = 0.5)
resid = data.frame(residuals = model_auto_vr_out$residuals)
#create Q-Q plot for residuals
ggplot(resid,
aes(sample = residuals)) +
stat_qq(color = '#99BFB3', alpha = 0.5, size = 5) +
stat_qq_line(color = '#6593A6', size = 2) +
xlab('theoretical quantiles') +
ylab('sample')
# create residuals histogram
ggplot(resid,
aes(x = residuals)) +
geom_histogram(color = '#C3B2AF', fill = '#C3B2AF', alpha = 0.8) +
xlab('residuals') + ylab('')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.